Search CORE

13 research outputs found

EVEREST: automatic identification and classification of protein domains in all protein sequences

Author: A Bairoch
A Barak
A Bateman
A Heger
Amir Harel
B Boeckmann
CH Wu
E Portugaly
E Portugaly
Elon Portugaly
F Servant
HM Berman
J Gracy
J Gracy
J Liu
J Liu
J Park
J Schultz
JD Thompson
JM Chandonia
Michal Linial
N Kaplan
N Nagarajan
Nathan Linial
NJ Mulder
O Dekel
O Sasson
O Sasson
O Shachar
SF Altschul
SR Eddy
TF Smith
TJ Hubbard
Y Inbar
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Author: Ashburner
Bridges
D'haeseleer
E. Portugaly
Finn
Fitch
Kaplan
Kaplan
Liu
M. Fromer
M. Linial
Mulder
Murzin
Shachar
Sneath
Tatusov
Y. Loewenstein
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets

Crossref

PubMed Central

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Author: AC McHardy
AR Quinlan
B Rodriguez-Brito
D Sheskin
DB Rusch
DC Richter
DH Huson
E Portugaly
EA Dinsdale
EF DeLong
FE Angly
GW Tyson
H Noguchi
H Noguchi
H Teeling
H Teeling
J Shendure
JC Venter
K Mavromatis
KJ Hoff
L Krause
PD Schloss
R Seshadri
RK Aziz
S Yooseph
S Yooseph
SF Altschul
SG Tringe
SR Eddy
SR Gill
W Li
W Li
W Li
W Li
Weizhong Li
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand. Results The new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes". Conclusion RAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from <url>http://tools.camera.calit2.net/camera/rammcap/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology

Author: A Andreeva
A Bahr
A Bateman
A Bateman
A Bernsel
A Kihara
A Klug
A Marchler-Bauer
A Stojmirovic
AA Schaffer
AA Schaffer
AE Todd
AG Murzin
AL Cuff
AM Schnoes
AM Settles
B Eisenhaber
B Eisenhaber
B Eisenhaber
B Eisenhaber
B Scheres
C Bru
C Sander
C Xu
CA Ouzounis
CH Wu
CP Ponting
CP Ponting
CP Ponting
D Devos
D Ivanov
D Wilson
DA Uwanogho
DE de Oliveira
DL Burgess
E Portugaly
EL Sonnhammer
EL Sonnhammer
F Eisenhaber
F Eisenhaber
F Eisenhaber
Frank Eisenhaber
G Schneider
GC Clark
GE Tusnady
H Ashida
H Johansson
H Mi
H Nielsen
HS Ooi
I Letunic
IL Alberts
J Abendroth
J Gough
J Kota
J Ren
J Schultz
J Schultz
JC McNulty
JC Pizarro
JC Wootton
JD Bendtsen
JD Selengut
JG Henikoff
JH Weiner
JH Zar
JI Shin
JK Tie
L Aravind
L Kall
L Kall
L Sun
L Zhang
LF Ciufo
LJ Smith
M Cserzo
M Cserzo
M Fukuda
M Gruber
M Hedman
M Ikeda
MH Saier Jr
MR Yen
N Hulo
N Kageyama-Yahara
O Leon
P Bork
P Bork
P Bork
P Tompa
P Tompa
PH Krebsbach
Philip E. Bourne
R Albrecht
R Durbin
R Janssen
R Watanabe
RD Finn
RF Doolittle
RR Copley
RW Hooft
S Henikoff
S Iuchi
S Ohnishi
S Veretnik
SA Weston
Sebastian Maurer-Stroh
SF Altschul
SF Altschul
SJ Sammut
SR Eddy
SR Eddy
SS Krishna
T Nakai
TA Holland
TK Attwood
V Anantharaman
V Brendel
VV Lunin
W Li
W Verelst
Wing-Cheong Wong
WR Gilks
WR Gilks
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS